{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Word vectors from SEC filings using Gensim: Preprocessing"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this section, we will learn word and phrase vectors from annual SEC filings using gensim to illustrate the potential value of word embeddings for algorithmic trading. In the following sections, we will combine these vectors as features with price returns to train neural networks to predict equity prices from the content of security filings.\n",
    "\n",
    "In particular, we use a dataset containing over 22,000 10-K annual reports from the period 2013-2016 that are filed by listed companies and contain both financial information and management commentary (see chapter 3 on Alternative Data). For about half of 11K filings for companies that we have stock prices to label the data for predictive modeling"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports & Settings"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:44:28.754726Z",
     "start_time": "2020-06-21T16:44:28.749484Z"
    }
   },
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:44:30.118307Z",
     "start_time": "2020-06-21T16:44:29.480278Z"
    }
   },
   "outputs": [],
   "source": [
    "from dateutil.relativedelta import relativedelta\n",
    "from pathlib import Path\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from time import time\n",
    "from collections import Counter\n",
    "import logging\n",
    "import spacy\n",
    "\n",
    "from gensim.models import Word2Vec\n",
    "from gensim.models.word2vec import LineSentence\n",
    "from gensim.models.phrases import Phrases, Phraser"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:44:30.121155Z",
     "start_time": "2020-06-21T16:44:30.119389Z"
    }
   },
   "outputs": [],
   "source": [
    "np.random.seed(42)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:44:30.133327Z",
     "start_time": "2020-06-21T16:44:30.122236Z"
    }
   },
   "outputs": [],
   "source": [
    "def format_time(t):\n",
    "    m, s = divmod(t, 60)\n",
    "    h, m = divmod(m, 60)\n",
    "    return f'{h:02.0f}:{m:02.0f}:{s:02.0f}'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Logging Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:44:30.737737Z",
     "start_time": "2020-06-21T16:44:30.725561Z"
    }
   },
   "outputs": [],
   "source": [
    "logging.basicConfig(\n",
    "        filename='preprocessing.log',\n",
    "        level=logging.DEBUG,\n",
    "        format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',\n",
    "        datefmt='%H:%M:%S')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Download"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The data can be downloaded from [here](https://drive.google.com/uc?id=0B4NK0q0tDtLFendmeHNsYzNVZ2M&export=download). Unzip and move into the `data` folder in the repository's root directory and rename to `filings`."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Paths"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each filing is a separate text file and a master index contains filing metadata. We extract the most informative sections, namely\n",
    "- Item 1 and 1A: Business and Risk Factors\n",
    "- Item 7 and 7A: Management's Discussion and Disclosures about Market Risks\n",
    "\n",
    "The notebook preprocessing shows how to parse and tokenize the text using spaCy, similar to the approach in chapter 14. We do not lemmatize the tokens to preserve nuances of word usage.\n",
    "\n",
    "We use gensim to detect phrases. The Phrases module scores the tokens and the Phraser class transforms the text data accordingly. The notebook shows how to repeat the process to create longer phrases."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:44:34.329500Z",
     "start_time": "2020-06-21T16:44:34.321563Z"
    }
   },
   "outputs": [],
   "source": [
    "sec_path = Path('..', 'data', 'sec-filings')\n",
    "filing_path = sec_path / 'filings'\n",
    "sections_path = sec_path / 'sections'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:44:34.912104Z",
     "start_time": "2020-06-21T16:44:34.903535Z"
    }
   },
   "outputs": [],
   "source": [
    "if not sections_path.exists():\n",
    "    sections_path.mkdir(exist_ok=True, parents=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Identify Sections"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T02:52:25.427816Z",
     "start_time": "2020-06-21T02:44:04.864855Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "500 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000 7500 8000 8500 9000 9500 10000 10500 11000 11500 12000 12500 13000 13500 14000 14500 15000 15500 16000 16500 17000 17500 18000 18500 19000 19500 20000 20500 21000 21500 22000 22500 "
     ]
    }
   ],
   "source": [
    "for i, filing in enumerate(filing_path.glob('*.txt'), 1):\n",
    "    if i % 500 == 0:\n",
    "        print(i, end=' ', flush=True)\n",
    "    filing_id = int(filing.stem)\n",
    "    items = {}\n",
    "    for section in filing.read_text().lower().split('°'):\n",
    "        if section.startswith('item '):\n",
    "            if len(section.split()) > 1:\n",
    "                item = section.split()[1].replace('.', '').replace(':', '').replace(',', '')\n",
    "                text = ' '.join([t for t in section.split()[2:]])\n",
    "                if items.get(item) is None or len(items.get(item)) < len(text):\n",
    "                    items[item] = text\n",
    "\n",
    "    txt = pd.Series(items).reset_index()\n",
    "    txt.columns = ['item', 'text']\n",
    "    txt.to_csv(sections_path / (filing.stem + '.csv'), index=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Parse Sections"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Select the following sections:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T02:52:25.430613Z",
     "start_time": "2020-06-21T02:52:25.428829Z"
    }
   },
   "outputs": [],
   "source": [
    "sections = ['1', '1a', '7', '7a']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:45:03.290044Z",
     "start_time": "2020-06-21T16:45:03.287811Z"
    }
   },
   "outputs": [],
   "source": [
    "clean_path = sec_path / 'selected_sections'\n",
    "if not clean_path.exists():\n",
    "    clean_path.mkdir(exist_ok=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T03:24:54.421552Z",
     "start_time": "2020-06-21T03:24:53.708348Z"
    }
   },
   "outputs": [],
   "source": [
    "nlp = spacy.load('en', disable=['ner'])\n",
    "nlp.max_length = 6000000"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T12:05:56.150635Z",
     "start_time": "2020-06-21T03:25:21.267371Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  100\t00:02:38\t18,125\t09:53:45\n",
      "  200\t00:05:36\t17,183\t10:28:08\n",
      "  300\t00:08:30\t16,514\t10:32:46\n",
      "  400\t00:10:57\t17,093\t10:08:36\n",
      "  500\t00:13:21\t17,482\t09:50:42\n",
      "  600\t00:15:56\t17,806\t09:45:08\n",
      "  700\t00:18:33\t18,003\t09:41:23\n",
      "  800\t00:20:46\t18,139\t09:26:55\n",
      "  900\t00:23:07\t18,262\t09:18:06\n",
      " 1000\t00:25:33\t18,342\t09:12:43\n",
      " 1100\t00:27:51\t18,425\t09:05:06\n",
      " 1200\t00:30:27\t18,486\t09:03:41\n",
      " 1300\t00:33:05\t18,536\t09:02:49\n",
      " 1400\t00:35:36\t18,579\t08:59:47\n",
      " 1500\t00:38:15\t18,621\t08:58:47\n",
      " 1600\t00:40:39\t18,666\t08:54:19\n",
      " 1700\t00:42:57\t18,714\t08:48:44\n",
      " 1800\t00:45:36\t18,759\t08:47:41\n",
      " 1900\t00:47:52\t18,805\t08:42:17\n",
      " 2000\t00:50:14\t18,853\t08:38:10\n",
      " 2100\t00:52:23\t18,879\t08:32:06\n",
      " 2200\t00:54:43\t18,908\t08:28:11\n",
      " 2300\t00:57:17\t18,908\t08:26:20\n",
      " 2400\t00:59:48\t18,834\t08:24:02\n",
      " 2500\t01:01:56\t18,868\t08:18:43\n",
      " 2600\t01:04:21\t18,898\t08:15:43\n",
      " 2700\t01:06:23\t18,924\t08:10:03\n",
      " 2800\t01:08:29\t18,951\t08:05:05\n",
      " 2900\t01:10:42\t18,981\t08:01:03\n",
      " 3000\t01:12:49\t19,008\t07:56:28\n",
      " 3100\t01:15:12\t19,030\t07:53:45\n",
      " 3200\t01:17:46\t19,052\t07:52:11\n",
      " 3300\t01:20:04\t19,074\t07:49:02\n",
      " 3400\t01:22:17\t19,098\t07:45:27\n",
      " 3500\t01:24:32\t19,117\t07:42:03\n",
      " 3600\t01:26:45\t19,134\t07:38:35\n",
      " 3700\t01:28:52\t19,151\t07:34:43\n",
      " 3800\t01:31:00\t19,167\t07:30:58\n",
      " 3900\t01:33:15\t19,189\t07:27:52\n",
      " 4000\t01:35:44\t19,204\t07:25:53\n",
      " 4100\t01:37:58\t19,217\t07:22:49\n",
      " 4200\t01:40:21\t19,227\t07:20:22\n",
      " 4300\t01:42:40\t19,241\t07:17:39\n",
      " 4400\t01:45:03\t19,250\t07:15:15\n",
      " 4500\t01:47:19\t19,258\t07:12:23\n",
      " 4600\t01:49:34\t19,269\t07:09:30\n",
      " 4700\t01:51:51\t19,277\t07:06:45\n",
      " 4800\t01:53:55\t19,286\t07:03:12\n",
      " 4900\t01:56:07\t19,298\t07:00:11\n",
      " 5000\t01:58:29\t19,305\t06:57:47\n",
      " 5100\t02:00:38\t19,316\t06:54:42\n",
      " 5200\t02:02:47\t19,322\t06:51:34\n",
      " 5300\t02:04:53\t19,328\t06:48:21\n",
      " 5400\t02:07:18\t19,336\t06:46:11\n",
      " 5500\t02:09:56\t19,347\t06:44:43\n",
      " 5600\t02:12:05\t19,354\t06:41:42\n",
      " 5700\t02:14:05\t19,359\t06:38:18\n",
      " 5800\t02:16:29\t19,368\t06:36:04\n",
      " 5900\t02:18:34\t19,378\t06:32:57\n",
      " 6000\t02:20:58\t19,382\t06:30:43\n",
      " 6100\t02:23:23\t19,388\t06:28:33\n",
      " 6200\t02:25:33\t19,396\t06:25:44\n",
      " 6300\t02:27:41\t19,405\t06:22:51\n",
      " 6400\t02:30:02\t19,412\t06:20:31\n",
      " 6500\t02:32:16\t19,418\t06:17:53\n",
      " 6600\t02:34:22\t19,424\t06:14:57\n",
      " 6700\t02:36:32\t19,430\t06:12:13\n",
      " 6800\t02:38:56\t19,432\t06:10:01\n",
      " 6900\t02:41:05\t19,433\t06:07:15\n",
      " 7000\t02:43:38\t19,436\t06:05:23\n",
      " 7100\t02:46:09\t19,439\t06:03:27\n",
      " 7200\t02:48:23\t19,443\t06:00:53\n",
      " 7300\t02:50:49\t19,442\t05:58:44\n",
      " 7400\t02:53:09\t19,444\t05:56:24\n",
      " 7500\t02:55:20\t19,448\t05:53:45\n",
      " 7600\t02:57:48\t19,452\t05:51:38\n",
      " 7700\t03:00:12\t19,451\t05:49:26\n",
      " 7800\t03:02:39\t19,453\t05:47:18\n",
      " 7900\t03:04:59\t19,456\t05:44:57\n",
      " 8000\t03:06:55\t19,460\t05:41:51\n",
      " 8100\t03:09:12\t19,464\t05:39:26\n",
      " 8200\t03:11:20\t19,469\t05:36:44\n",
      " 8300\t03:13:31\t19,473\t05:34:09\n",
      " 8400\t03:15:45\t19,478\t05:31:38\n",
      " 8500\t03:18:02\t19,483\t05:29:13\n",
      " 8600\t03:20:21\t19,488\t05:26:52\n",
      " 8700\t03:22:28\t19,494\t05:24:12\n",
      " 8800\t03:24:48\t19,498\t05:21:53\n",
      " 8900\t03:27:06\t19,504\t05:19:31\n",
      " 9000\t03:29:28\t19,511\t05:17:16\n",
      " 9100\t03:31:41\t19,514\t05:14:45\n",
      " 9200\t03:33:53\t19,518\t05:12:14\n",
      " 9300\t03:36:15\t19,521\t05:09:59\n",
      " 9400\t03:38:35\t19,528\t05:07:41\n",
      " 9500\t03:40:50\t19,534\t05:05:14\n",
      " 9600\t03:43:02\t19,539\t05:02:45\n",
      " 9700\t03:45:23\t19,539\t05:00:28\n",
      " 9800\t03:47:45\t19,541\t04:58:12\n",
      " 9900\t03:49:56\t19,545\t04:55:41\n",
      "10000\t03:51:60\t19,549\t04:53:02\n",
      "10100\t03:54:13\t19,553\t04:50:36\n",
      "10200\t03:56:37\t19,558\t04:48:22\n",
      "10300\t03:59:01\t19,562\t04:46:09\n",
      "10400\t04:01:29\t19,566\t04:44:00\n",
      "10500\t04:03:49\t19,568\t04:41:41\n",
      "10600\t04:06:03\t19,573\t04:39:16\n",
      "10700\t04:08:28\t19,577\t04:37:04\n",
      "10800\t04:10:45\t19,581\t04:34:41\n",
      "10900\t04:13:10\t19,585\t04:32:28\n",
      "11000\t04:15:13\t19,588\t04:29:51\n",
      "11100\t04:17:46\t19,592\t04:27:47\n",
      "11200\t04:20:06\t19,593\t04:25:27\n",
      "11300\t04:22:25\t19,596\t04:23:09\n",
      "11400\t04:24:29\t19,599\t04:20:34\n",
      "11500\t04:26:39\t19,603\t04:18:06\n",
      "11600\t04:29:06\t19,605\t04:15:54\n",
      "11700\t04:31:25\t19,609\t04:13:35\n",
      "11800\t04:33:43\t19,612\t04:11:14\n",
      "11900\t04:35:54\t19,613\t04:08:48\n",
      "12000\t04:38:21\t19,617\t04:06:36\n",
      "12100\t04:40:38\t19,619\t04:04:15\n",
      "12200\t04:43:09\t19,621\t04:02:05\n",
      "12300\t04:45:18\t19,623\t03:59:37\n",
      "12400\t04:47:37\t19,626\t03:57:18\n",
      "12500\t04:49:53\t19,629\t03:54:57\n",
      "12600\t04:51:59\t19,631\t03:52:27\n",
      "12700\t04:54:05\t19,634\t03:49:58\n",
      "12800\t04:56:16\t19,636\t03:47:33\n",
      "12900\t04:58:43\t19,639\t03:45:20\n",
      "13000\t05:01:04\t19,643\t03:43:02\n",
      "13100\t05:03:13\t19,646\t03:40:36\n",
      "13200\t05:05:34\t19,648\t03:38:19\n",
      "13300\t05:07:57\t19,650\t03:36:03\n",
      "13400\t05:10:23\t19,652\t03:33:49\n",
      "13500\t05:12:43\t19,654\t03:31:31\n",
      "13600\t05:14:42\t19,657\t03:28:59\n",
      "13700\t05:17:04\t19,658\t03:26:42\n",
      "13800\t05:19:11\t19,662\t03:24:15\n",
      "13900\t05:21:22\t19,665\t03:21:51\n",
      "14000\t05:23:52\t19,668\t03:19:40\n",
      "14100\t05:26:12\t19,669\t03:17:22\n",
      "14200\t05:28:30\t19,671\t03:15:03\n",
      "14300\t05:30:26\t19,674\t03:12:30\n",
      "14400\t05:32:47\t19,676\t03:10:13\n",
      "14500\t05:35:05\t19,679\t03:07:54\n",
      "14600\t05:37:30\t19,682\t03:05:39\n",
      "14700\t05:39:50\t19,685\t03:03:21\n",
      "14800\t05:41:54\t19,689\t03:00:54\n",
      "14900\t05:44:06\t19,692\t02:58:32\n",
      "15000\t05:46:16\t19,694\t02:56:10\n",
      "15100\t05:48:32\t19,696\t02:53:50\n",
      "15200\t05:50:45\t19,698\t02:51:29\n",
      "15300\t05:52:56\t19,700\t02:49:06\n",
      "15400\t05:55:12\t19,702\t02:46:47\n",
      "15500\t05:57:27\t19,704\t02:44:27\n",
      "15600\t05:59:48\t19,704\t02:42:10\n",
      "15700\t06:01:60\t19,705\t02:39:48\n",
      "15800\t06:04:10\t19,707\t02:37:27\n",
      "15900\t06:06:17\t19,708\t02:35:04\n",
      "16000\t06:08:18\t19,710\t02:32:38\n",
      "16100\t06:10:21\t19,713\t02:30:14\n",
      "16200\t06:12:48\t19,713\t02:27:60\n",
      "16300\t06:14:58\t19,715\t02:25:38\n",
      "16400\t06:17:09\t19,714\t02:23:18\n",
      "16500\t06:19:26\t19,715\t02:20:59\n",
      "16600\t06:21:36\t19,717\t02:18:38\n",
      "16700\t06:23:39\t19,719\t02:16:15\n",
      "16800\t06:25:57\t19,720\t02:13:57\n",
      "16900\t06:28:29\t19,721\t02:11:44\n",
      "17000\t06:30:55\t19,723\t02:09:29\n",
      "17100\t06:33:19\t19,725\t02:07:13\n",
      "17200\t06:35:40\t19,725\t02:04:56\n",
      "17300\t06:37:54\t19,727\t02:02:37\n",
      "17400\t06:40:24\t19,728\t02:00:22\n",
      "17500\t06:42:43\t19,730\t01:58:05\n",
      "17600\t06:44:57\t19,733\t01:55:45\n",
      "17700\t06:47:14\t19,734\t01:53:27\n",
      "17800\t06:49:27\t19,735\t01:51:08\n",
      "17900\t06:51:35\t19,737\t01:48:47\n",
      "18000\t06:53:39\t19,738\t01:46:25\n",
      "18100\t06:55:52\t19,740\t01:44:06\n",
      "18200\t06:58:01\t19,741\t01:41:46\n",
      "18300\t07:00:07\t19,741\t01:39:26\n",
      "18400\t07:02:40\t19,739\t01:37:11\n",
      "18500\t07:04:53\t19,737\t01:34:52\n",
      "18600\t07:07:11\t19,736\t01:32:35\n",
      "18700\t07:09:44\t19,735\t01:30:20\n",
      "18800\t07:12:01\t19,733\t01:28:02\n",
      "18900\t07:14:23\t19,731\t01:25:45\n",
      "19000\t07:16:51\t19,730\t01:23:29\n",
      "19100\t07:19:04\t19,730\t01:21:10\n",
      "19200\t07:21:27\t19,728\t01:18:53\n",
      "19300\t07:23:60\t19,725\t01:16:38\n",
      "19400\t07:26:18\t19,726\t01:14:20\n",
      "19500\t07:28:50\t19,724\t01:12:04\n",
      "19600\t07:31:13\t19,724\t01:09:47\n",
      "19700\t07:33:37\t19,722\t01:07:29\n",
      "19800\t07:35:47\t19,721\t01:05:10\n",
      "19900\t07:38:20\t19,719\t01:02:54\n",
      "20000\t07:40:44\t19,719\t01:00:37\n",
      "20100\t07:42:56\t19,720\t00:58:18\n",
      "20200\t07:45:08\t19,720\t00:55:59\n",
      "20300\t07:47:17\t19,720\t00:53:39\n",
      "20400\t07:49:17\t19,720\t00:51:19\n",
      "20500\t07:51:35\t19,721\t00:49:01\n",
      "20600\t07:53:55\t19,720\t00:46:43\n",
      "20700\t07:56:16\t19,720\t00:44:26\n",
      "20800\t07:58:14\t19,719\t00:42:06\n",
      "20900\t08:00:33\t19,719\t00:39:48\n",
      "21000\t08:02:53\t19,718\t00:37:30\n",
      "21100\t08:04:54\t19,718\t00:35:11\n",
      "21200\t08:07:18\t19,717\t00:32:54\n",
      "21300\t08:09:40\t19,716\t00:30:36\n",
      "21400\t08:11:56\t19,716\t00:28:18\n",
      "21500\t08:14:09\t19,716\t00:25:60\n",
      "21600\t08:16:32\t19,715\t00:23:42\n",
      "21700\t08:18:49\t19,715\t00:21:24\n",
      "21800\t08:21:02\t19,714\t00:19:06\n",
      "21900\t08:23:18\t19,714\t00:16:48\n",
      "22000\t08:25:44\t19,713\t00:14:30\n",
      "22100\t08:28:19\t19,711\t00:12:13\n",
      "22200\t08:30:33\t19,709\t00:09:55\n",
      "22300\t08:33:16\t19,698\t00:07:37\n",
      "22400\t08:35:44\t19,696\t00:05:19\n",
      "22500\t08:38:06\t19,704\t00:03:01\n",
      "22600\t08:39:59\t19,712\t00:00:43\n"
     ]
    }
   ],
   "source": [
    "vocab = Counter()\n",
    "t = total_tokens = 0\n",
    "stats = []\n",
    "\n",
    "start = time()\n",
    "to_do = len(list(sections_path.glob('*.csv')))\n",
    "done = len(list(clean_path.glob('*.csv'))) + 1\n",
    "for text_file in sections_path.glob('*.csv'):\n",
    "    file_id = int(text_file.stem)\n",
    "    clean_file = clean_path / f'{file_id}.csv'\n",
    "    if clean_file.exists():\n",
    "        continue\n",
    "    items = pd.read_csv(text_file).dropna()\n",
    "    items.item = items.item.astype(str)\n",
    "    items = items[items.item.isin(sections)]\n",
    "    if done % 100 == 0:\n",
    "        duration = time() - start\n",
    "        to_go = (to_do - done) * duration / done\n",
    "        print(f'{done:>5}\\t{format_time(duration)}\\t{total_tokens / duration:,.0f}\\t{format_time(to_go)}')\n",
    "    \n",
    "    clean_doc = []\n",
    "    for _, (item, text) in items.iterrows():\n",
    "        doc = nlp(text)\n",
    "        for s, sentence in enumerate(doc.sents):\n",
    "            clean_sentence = []\n",
    "            if sentence is not None:\n",
    "                for t, token in enumerate(sentence, 1):\n",
    "                    if not any([token.is_stop,\n",
    "                                token.is_digit,\n",
    "                                not token.is_alpha,\n",
    "                                token.is_punct,\n",
    "                                token.is_space,\n",
    "                                token.lemma_ == '-PRON-',\n",
    "                                token.pos_ in ['PUNCT', 'SYM', 'X']]):\n",
    "                        clean_sentence.append(token.text.lower())\n",
    "                total_tokens += t\n",
    "                if len(clean_sentence) > 0:\n",
    "                    clean_doc.append([item, s, ' '.join(clean_sentence)])\n",
    "    (pd.DataFrame(clean_doc,\n",
    "                  columns=['item', 'sentence', 'text'])\n",
    "     .dropna()\n",
    "     .to_csv(clean_file, index=False))\n",
    "    done += 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create ngrams"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:45:07.927900Z",
     "start_time": "2020-06-21T16:45:07.918296Z"
    }
   },
   "outputs": [],
   "source": [
    "ngram_path = sec_path / 'ngrams'\n",
    "stats_path = sec_path / 'corpus_stats'\n",
    "for path in [ngram_path, stats_path]:\n",
    "    if not path.exists():\n",
    "        path.mkdir(parents=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T13:32:42.332829Z",
     "start_time": "2020-06-21T13:32:42.331171Z"
    }
   },
   "outputs": [],
   "source": [
    "unigrams = ngram_path / 'ngrams_1.txt'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T13:32:46.109338Z",
     "start_time": "2020-06-21T13:32:46.104347Z"
    }
   },
   "outputs": [],
   "source": [
    "def create_unigrams(min_length=3):\n",
    "    texts = []\n",
    "    sentence_counter = Counter()\n",
    "    vocab = Counter()\n",
    "    for i, f in enumerate(clean_path.glob('*.csv')):\n",
    "        if i % 1000 == 0:\n",
    "            print(i, end=' ', flush=True)\n",
    "        df = pd.read_csv(f)\n",
    "        df.item = df.item.astype(str)\n",
    "        df = df[df.item.isin(sections)]\n",
    "        sentence_counter.update(df.groupby('item').size().to_dict())\n",
    "        for sentence in df.text.dropna().str.split().tolist():\n",
    "            if len(sentence) >= min_length:\n",
    "                vocab.update(sentence)\n",
    "                texts.append(' '.join(sentence))\n",
    "    \n",
    "    (pd.DataFrame(sentence_counter.most_common(), \n",
    "                  columns=['item', 'sentences'])\n",
    "     .to_csv(stats_path / 'selected_sentences.csv', index=False))\n",
    "    (pd.DataFrame(vocab.most_common(), columns=['token', 'n'])\n",
    "     .to_csv(stats_path / 'sections_vocab.csv', index=False))\n",
    "    \n",
    "    unigrams.write_text('\\n'.join(texts))\n",
    "    return [l.split() for l in texts]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T13:37:06.760726Z",
     "start_time": "2020-06-21T13:32:52.427226Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 14000 15000 16000 17000 18000 19000 20000 21000 22000 \n",
      "Reading:  00:04:14\n"
     ]
    }
   ],
   "source": [
    "start = time()\n",
    "if not unigrams.exists():\n",
    "    texts = create_unigrams()\n",
    "else:\n",
    "    texts = [l.split() for l in unigrams.open()]\n",
    "print('\\nReading: ', format_time(time() - start))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T13:37:06.775365Z",
     "start_time": "2020-06-21T13:37:06.762138Z"
    }
   },
   "outputs": [],
   "source": [
    "def create_ngrams(max_length=3):\n",
    "    \"\"\"Using gensim to create ngrams\"\"\"\n",
    "\n",
    "    n_grams = pd.DataFrame()\n",
    "    start = time()\n",
    "    for n in range(2, max_length + 1):\n",
    "        print(n, end=' ', flush=True)\n",
    "\n",
    "        sentences = LineSentence(ngram_path / f'ngrams_{n - 1}.txt')\n",
    "        phrases = Phrases(sentences=sentences,\n",
    "                          min_count=25,  # ignore terms with a lower count\n",
    "                          threshold=0.5,  # accept phrases with higher score\n",
    "                          max_vocab_size=40000000,  # prune of less common words to limit memory use\n",
    "                          delimiter=b'_',  # how to join ngram tokens\n",
    "                          progress_per=50000,  # log progress every\n",
    "                          scoring='npmi')\n",
    "\n",
    "        s = pd.DataFrame([[k.decode('utf-8'), v] for k, v in phrases.export_phrases(sentences)], \n",
    "                         columns=['phrase', 'score']).assign(length=n)\n",
    "\n",
    "        n_grams = pd.concat([n_grams, s])\n",
    "        grams = Phraser(phrases)\n",
    "        sentences = grams[sentences]\n",
    "        (ngram_path / f'ngrams_{n}.txt').write_text('\\n'.join([' '.join(s) for s in sentences]))\n",
    "\n",
    "    n_grams = n_grams.sort_values('score', ascending=False)\n",
    "    n_grams.phrase = n_grams.phrase.str.replace('_', ' ')\n",
    "    n_grams['ngram'] = n_grams.phrase.str.replace(' ', '_')\n",
    "\n",
    "    n_grams.to_parquet(sec_path / 'ngrams.parquet')\n",
    "\n",
    "    print('\\n\\tDuration: ', format_time(time() - start))\n",
    "    print('\\tngrams: {:,d}\\n'.format(len(n_grams)))\n",
    "    print(n_grams.groupby('length').size())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "start_time": "2020-06-21T13:33:01.712Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2 3 "
     ]
    }
   ],
   "source": [
    "create_ngrams()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Inspect Corpus"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:48:25.955300Z",
     "start_time": "2020-06-21T16:48:25.948615Z"
    }
   },
   "outputs": [],
   "source": [
    "percentiles=np.arange(.1, 1, .1).round(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:47:22.167784Z",
     "start_time": "2020-06-21T16:45:12.067631Z"
    }
   },
   "outputs": [],
   "source": [
    "nsents, ntokens = Counter(), Counter()\n",
    "for f in clean_path.glob('*.csv'):\n",
    "    df = pd.read_csv(f)\n",
    "    nsents.update({str(k): v for k, v in df.item.value_counts().to_dict().items()})\n",
    "    df['ntokens'] = df.text.str.split().str.len()\n",
    "    ntokens.update({str(k): v for k, v in df.groupby('item').ntokens.sum().to_dict().items()})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:47:22.173106Z",
     "start_time": "2020-06-21T16:47:22.168794Z"
    }
   },
   "outputs": [],
   "source": [
    "ntokens = pd.DataFrame(ntokens.most_common(), columns=['Item', '# Tokens'])\n",
    "nsents = pd.DataFrame(nsents.most_common(), columns=['Item', '# Sentences'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:47:22.488178Z",
     "start_time": "2020-06-21T16:47:22.174110Z"
    }
   },
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAYEAAAERCAYAAACdPxtnAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjEsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+j8jraAAAciUlEQVR4nO3df5RVdb3/8eeLGYgfErqkrsp4HWoJSIADjYCafoX8AWpxvwbfJZomaIQ/0uzmNypTb+patvQqVyWJiEv29eIqxbIrV0llFMkfTDCB/JAQSScsFcVERBx8f/84h1mH4cycM3DOGQ779VhrlrPP/pzPec+W2a/Zn733ZysiMDOzZOrU0QWYmVnHcQiYmSWYQ8DMLMEcAmZmCeYQMDNLMIeAmVmCFS0EJM2R9IakF/Ns/38krZa0StJ/FasuM7NCa8/+TtI/S1okabmkFZLOLEWNrdZTrPsEJJ0MbAXujYhBOdoeDfwKGB0R70j6dES8UZTCzMwKrJ37u1nA8oi4R9JAYEFEVJegzKyKdiQQEU8Db2e+Jumzkh6V9EdJiyUNSK/6OjAjIt5Jv9cBYGZlo537uwA+mf6+F7CphKXuobLEnzcLmBoRf5Y0AvgJMBroByBpCVAB3BARj5a4NjOzQmptf3cDsFDSN4EewKkdV2IJQ0DSQcAJwK8l7Xr5Exl1HA2cAlQBiyUNiogtparPzKxQcuzvJgJzI+LfJR0P/DK9v/u4A0ot6ZFAJ2BLRNRkWdcIPBcRHwGvSHqJVCgsLWF9ZmaF0tb+7mJgDEBEPCupK9Ab6JBh8JJdIhoR/yC1g58AoJRj06t/A4xKv96b1PDQhlLVZmZWSDn2d68CX0y/fgzQFXizQwqluJeIzgOeBfpLapR0MXA+cLGkPwGrgHHp5o8BmyWtBhYB10TE5mLVZmZWSO3c3/0r8PX06/OAi6IDp3Mu2iWiZma2/8vrSEDS1embuF6UNC89hmVmZmUu55GApD7AM8DAiPhA0q9I3dwwt7X3dOrUKbp161bQQs3MDmTbtm2LiCj5VD75Xh1UCXST9BHQnRw3N3Tr1o33339/X2szM0sMSR90xOfmTJ2I+CtwG6kz2q8D70bEwpbtJE2RVC+pvqmpqfCVmplZweUMAUmHkDqr3Rc4Augh6ast20XErIiojYjayspS34hsZmZ7I5/xp1OBVyLizfTNXPNJ3QlnZmZlLp8/2V8FRkrqDnxA6iaH+qJWZWZ7+Oijj2hsbGT79u0dXYrtg65du1JVVUXnzp07uhQgjxCIiOclPQAsA5qA5aQmRjKzEmpsbKRnz55UV1eTMR+NlZGIYPPmzTQ2NtK3b9+OLgfI8z6BiLg+IgZExKCIuCAiPix2YWa2u+3bt3PooYc6AMqYJA499ND96mjOj5c0KyMOgPK3v/0/dAiYmSWYr+U0K1PV0x4paH8bbzkr77bf+973OOOMM9iyZQtr165l2rRpe7R56aWX+MY3vsGWLVv48MMPOemkk5g1a+9OJ06fPp0pU6bQvXv3vXq/ta5sQqDg/+C7nlfQ/gC44d3C92m2H3r++ee57rrr+P73v8/48eOztrnyyiu5+uqrGTcuNXnmypUr9/rzpk+fzle/+lWHQBF4OMjM8nbNNdcwZMgQli5dyvHHH8/s2bO59NJL+dGPfrRH29dff52qqqrm5cGDBwOwc+dOrrnmGo477jiGDBnCT3/6UwDq6uo45ZRTGD9+PAMGDOD8888nIrjzzjvZtGkTo0aNYtSoUQAsXLiQ448/nmHDhjFhwgS2bt0KQHV1Nddffz3Dhg1j8ODBrF27FoCtW7cyadIkBg8ezJAhQ3jwwQfb7GfatGkMHDiQIUOG8J3vfKdIW3P/4BAws7zdeuutzJ49m4suuoilS5cyZMgQVqxYwXXXXbdH26uvvprRo0czduxY7rjjDrZsST0t9uc//zm9evVi6dKlLF26lJ/97Ge88sorACxfvpzp06ezevVqNmzYwJIlS7jyyis54ogjWLRoEYsWLeKtt97ipptu4vHHH2fZsmXU1tZy++23N39u7969WbZsGZdeeim33XYbADfeeCO9evVi5cqVrFixgtGjR7faz9tvv81DDz3EqlWrWLFiBddee20JtmzHcQiYWbssX76cmpoa1q5dy8CBA1ttN2nSJNasWcOECROoq6tj5MiRfPjhhyxcuJB7772XmpoaRowYwebNm/nzn/8MwPDhw6mqqqJTp07U1NSwcePGPfp97rnnWL16NSeeeCI1NTX84he/4C9/+Uvz+nPOOQeAz3/+883vf/zxx7n88sub2xxyyCGt9vPJT36Srl27cskllzB//vwDfgiqbM4JmFnHamho4KKLLqKxsZHevXuzbds2IoKamhqeffZZsk0ff8QRRzB58mQmT57MoEGDePHFF4kI7rrrLs4444zd2tbV1fGJT3yiebmiooJsk1FGBKeddhrz5s3LWueuPjLfHxF7XJrZVj8vvPACTzzxBPfffz933303Tz75ZI6tU758JGBmeampqaGhoYF+/fqxevVqRo8ezWOPPUZDQ0PWAHj00Uf56KOPAPjb3/7G5s2b6dOnD2eccQb33HNP87p169blnHq+Z8+evPfeewCMHDmSJUuWsH79egC2bdvGunXr2nz/6aefzt133928/M4777Taz9atW3n33Xc588wzmT59Og0NDXluofLkIwGzMtWeSzoL5c033+SQQw6hU6dOOYeDFi5cyFVXXUXXrqkHEd56660cdthhXHLJJWzcuJFhw4YREXzqU5/iN7/5TZufO2XKFMaOHcvhhx/OokWLmDt3LhMnTuTDD1OTF9x0003069ev1fdfe+21XH755QwaNIiKigquv/56zjnnnKz99OzZk3HjxrF9+3YigjvuuKO9m6msFOUZwz169IhCP1TGl4ha0q1Zs4Zjjjmmo8uwAsj2/1LStojoUepaPBxkZpZgDgEzswRzCJiZJZhDwMwswRwCZmZlQtIcSW9IerGV9ZJ0p6T1klZIGparT4eAmVn5mAuMaWP9WODo9NcU4J5cHfo+AbNydUOvAveX/yXOuaaSvvnmm/n1r38NpGYP3TV53OTJk7nyyit3azt37lzq6+t3u5mrWD744APGjBnDk08+SUVFxR7rTzjhBP7whz+02Ud1dTX19fX07t17t9fr6uro0qULJ5xwAgB33303PXr0YNKkSQWrPyKellTdRpNxwL2Ruvb/OUkHSzo8Il5v7Q05jwQk9ZfUkPH1D0nfanf1ZnbAeP755xkxYgRPPfUUJ5100h7rf/CDH9DQ0NB8N/Gu71sGQKnNmTOHc845Z48A2LlzJ0DOAGhLXV3dbu+fPHkyd955Z3u6qJRUn/E1ZS/K6AO8lrHcmH6tVTlDICJeioiaiKgBPg9sAx7ai+LMrMy1ZyrplrZv3948nfPQoUNZtGjRHm0eeeQRjj/+eN566612Txf91FNPUVNTQ01NDUOHDm2eZiLTfffd1/x8g7q6OkaNGsV5553XfKRy0EEHAfDxxx9z2WWX8bnPfY6zzz6bM888kwceeKC5n7vuumu3z9+4cSMzZ87kjjvuoKamhsWLF9O9e3eqq6t54YUX8t28TRFRm/G1N0/gyfbsyjbvCG7vOYEvAi9HxF9ytjSzA057ppJuacaMGUBqeGjevHl87Wtf2+2B6w899BC33HILCxYsAGj3dNG33XYbM2bMoKGhgcWLF+8xn9GOHTvYsGED1dXVza+98MIL3HzzzaxevXq3tvPnz2fjxo2sXLmS2bNn8+yzz+62vuXnV1dXM3XqVK6++moaGhqaj45qa2tZvHhxzm1TQI3AkRnLVcCmtt7Q3hA4F8g6dZ+kKbsOY7LN/GdmB4Z8p5Ju6ZlnnuGCCy4AYMCAARx11FHNE78tWrSIH//4xzzyyCNtTvO8S7bpok888US+/e1vc+edd7JlyxYqK3c/5fnWW29x8MEH7/ba8OHD6du3b9ZaJ0yYQKdOnTjssMOaH2bT1udn8+lPf5pNm9rcBxfaw8CF6auERgLvtnU+ANpxYlhSF+DLwPeyrU8fusyC1NxBeZdsZmVhb6aSztTWPGWf+cxn2LBhA+vWraO2tnavpoueNm0aZ511FgsWLGDkyJE8/vjjDBgwoPk93bp12+3IA6BHj+xT9eSaUy3b52ezffv2nNulPSTNA04BektqBK4HOgNExExgAXAmsJ7U0H3Os9LtORIYCyyLiL+3r2wzOxC0dyrplk4++WTuu+8+IDV99Kuvvkr//v0BOOqoo5g/fz4XXnghq1at2qvpol9++WUGDx7Md7/7XWpra5vPFexyyCGHsHPnzj2CIJsvfOELPPjgg3z88cf8/e9/p66uLud7Mqe73mXdunUMGjQo53vzFRETI+LwiOgcEVUR8fOImJkOACLl8oj4bEQMjoj6XH225xLRibQyFGRmHaADZq1tz1TSLV122WVMnTqVwYMHU1lZydy5c3d7iEz//v257777mDBhAr/73e/aPV309OnTWbRoERUVFQwcOJCxY8fu0eb000/nmWee4dRTT22z1q985Ss88cQTDBo0iH79+jFixAh69Wr7ktwvfelLjB8/nt/+9rfcddddnHTSSSxZsoTrr7++zfd1tLymkpbUndRlR5+JiJz/8jyVtFnheSrpfbd8+XJuv/12fvnLX+Zsu3XrVg466CA2b97M8OHDWbJkCYcddlhBPmt/mko6ryOBiNgGHFrkWszMimro0KGMGjWKnTt3Zr1ZLNPZZ5/Nli1b2LFjBz/84Q/bFQCQOhF944037ku5JeE7hs0sUSZPnpxXu3zOA7TltNNO26f3l4rnDjIrI8V4EqCV1v72/9AhYFYmunbtyubNm/e7nYjlLyLYvHlz83OX9wceDjIrE1VVVTQ2NvLmm292dCm2D7p27UpVVVVHl9HMIWBWJjp37pz17lazfeHhIDOzBHMImJklmEPAzCzBHAJmZgnmEDAzSzCHgJlZgjkEzMwSzPcJWGHc0PY0u3vXp2dlNSs2HwmYmSWYQ8DMLME8HJRAhX5AD8DG/Wc+LDNrB4eA2f7I51isRDwcZGaWYHmFgKSDJT0gaa2kNZKOL3ZhZmZWfPkOB/0H8GhEjJfUBehexJrMzKxEcoaApE8CJwMXAUTEDmBHccsyM7NSyGc46DPAm8B/SlouabakHi0bSZoiqV5SfVNTU8ELNTOzwssnBCqBYcA9ETEUeB+Y1rJRRMyKiNqIqK2s9EVHZmblIJ8QaAQaI+L59PIDpELBzMzKXM4QiIi/Aa9J6p9+6YvA6qJWZWZmJZHvuM03gfvSVwZtACYVryQzMyuVvO4TiIiG9Hj/kIj4l4h4p9iFmZnZ7iSNkfSSpPWS9jg3K6mXpN9J+pOkVZJy/sHuO4bNzMqApApgBjAWGAhMlDSwRbPLgdURcSxwCvDv6RGcVjkEzMzKw3BgfURsSN+vdT8wrkWbAHpKEnAQ8DbQ5jX7DgEzs/1D5a57rdJfU1qs7wO8lrHcmH4t093AMcAmYCVwVUR83OaH7mPRZmZWGE0RUdvGemV5LVosnwE0AKOBzwK/l7Q4Iv7RWqc+EjAzKw+NwJEZy1Wk/uLPNAmYHynrgVeAAW116hAwMysPS4GjJfVNn+w9F3i4RZtXSd3LhaR/AvqTuqy/VR4OMjMrAxHRJOkK4DGgApgTEaskTU2vnwncCMyVtJLU8NF3I+Kttvp1CJiZlYmIWAAsaPHazIzvNwGnt6dPDweZmSWYQ8DMLME8HGS2j6qnPVLwPjd2LXiXZln5SMDMLMEcAmZmCeYQMDNLMIeAmVmCOQTMzBLMIWBmlmAOATOzBMvrPgFJG4H3gJ3knu7UzMzKRHtuFhuVayIiMzMrLx4OMjNLsHxDIICFkv6Y5ZFnAEiasuuxaE1NbT7S0szM9hP5DgedGBGbJH2a1OPK1kbE05kNImIWMAugR48eLR95ZmZm+6G8jgTSc1QTEW8AD5F66r2ZmZW5nCEgqYeknru+J/XAgheLXZiZmRVfPsNB/wQ8JGlX+/+KiEeLWpWZmZVEzhCIiA3AsSWoxczMSsyXiJqZJZhDwMwswRwCZmYJ5hAwM0swh4CZWYI5BMzMEswhYGaWYA4BM7MEcwiYmSWYQ8DMLMEcAmZmZULSGEkvSVovaVorbU6R1CBplaSncvXZnsdLmplZB5FUAcwATgMagaWSHo6I1RltDgZ+AoyJiFfTz4Bpk48EzMzKw3BgfURsiIgdwP3AuBZtzgPmR8Sr0PwMmDY5BMzMykMf4LWM5cb0a5n6AYdIqks/DvjCXJ16OMjMbP9QKak+Y3lW+rG9uyjLe1o+yrcS+DzwRaAb8Kyk5yJiXasfurfVmplZQTVFRG0b6xuBIzOWq4BNWdq8FRHvA+9LeprU82BaDQEPB5mZlYelwNGS+krqApwLPNyizW+BkyRVSuoOjADWtNWpjwTMzMpARDRJugJ4DKgA5kTEKklT0+tnRsQaSY8CK4CPgdkR0eYz4fMOgfTlSfXAXyPi7L39QczMbO9ExAJgQYvXZrZYvhW4Nd8+2zMcdBU5DivMzKy85BUCkqqAs4DZxS3HzMxKKd8jgenA/yU1xpSVpCmS6iXVNzU1FaQ4MzMrrpwhIOls4I2I+GNb7SJiVkTURkRtZaXPN5uZlYN8jgROBL4saSOp25RHS/p/Ra3KzMxKImcIRMT3IqIqIqpJXZf6ZER8teiVmZlZ0flmMTOzBGvX4H1E1AF1RanEzMxKzkcCZmYJ5hAwM0swh4CZWYI5BMzMEswhYGaWYA4BM7MEcwiYmSWYQ8DMLMEcAmZmCeYQMDNLMIeAmVmCOQTMzBLMIWBmlmAOATOzBHMImJklmEPAzCzBHAJmZgnmEDAzS7CcISCpq6QXJP1J0ipJ/1aKwszMrPjyecbwh8DoiNgqqTPwjKT/iYjnilybmZkVWc4QiIgAtqYXO6e/ophFmZlZaeR1TkBShaQG4A3g9xHxfJY2UyTVS6pvamoqdJ1mZlYEeYVAROyMiBqgChguaVCWNrMiojYiaisr8xllMjOz9pA0RtJLktZLmtZGu+Mk7ZQ0Plef7bo6KCK2AHXAmPa8z8zM9o2kCmAGMBYYCEyUNLCVdj8GHsun33yuDvqUpIPT33cDTgXW5l+6mZkVwHBgfURsiIgdwP3AuCztvgk8SGr4Pqd8xm0OB36RTpdOwK8i4r/zq9nMzPJUKak+Y3lWRMzKWO4DvJax3AiMyOxAUh/gfwOjgePy+tBcDSJiBTA0n87MzGyvNUVEbRvrleW1lldqTge+GxE7pWzN9+QzuGZm5aERODJjuQrY1KJNLXB/OgB6A2dKaoqI37TWqUPAzKw8LAWOltQX+CtwLnBeZoOI6Lvre0lzgf9uKwDAIWBmVhYioknSFaSu+qkA5kTEKklT0+tn7k2/DgEzszIREQuABS1ey7rzj4iL8unTs4iamSWYQ8DMLMEcAmZmCeYQMDNLMIeAmVmCOQTMzBLMIWBmlmAOATOzBHMImJklmEPAzCzBHAJmZgnmEDAzSzCHgJlZgjkEzMwSLJ8HzR8paZGkNZJWSbqqFIWZmVnx5fM8gSbgXyNimaSewB8l/T4iVhe5NjMzK7KcRwIR8XpELEt//x6whtRT783MrMy168likqqBocDzWdZNAaYAdOnSpQClmZlZseV9YljSQcCDwLci4h8t10fErIiojYjayko/tdLMrBzkFQKSOpMKgPsiYn5xSzIzs1LJ5+ogAT8H1kTE7cUvyczMSiWfI4ETgQuA0ZIa0l9nFrkuMzMrgZyD9xHxDKAS1GJmZiXmO4bNzBLMIWBmlmAOATOzBHMImJklmEPAzCzBHAJmZgnmEDAzKxOSxkh6SdJ6SdOyrD9f0or01x8kHZurT4eAmVkZkFQBzADGAgOBiZIGtmj2CvC/ImIIcCMwK1e/DgEzs/IwHFgfERsiYgdwPzAus0FE/CEi3kkvPgdU5erUIWBmtn+olFSf8TWlxfo+wGsZy420/WyXi4H/yfmh7a/TzMyKoCkiattYn236nsjaUBpFKgS+kOtDHQJmZuWhETgyY7kK2NSykaQhwGxgbERsztWph4PMzMrDUuBoSX0ldQHOBR7ObCDpn4H5wAURsS6fTn0kYGZWBiKiSdIVwGNABTAnIlZJmppePxO4DjgU+EnqUTA5h5gcAmZm5SIiFgALWrw2M+P7S4BL2tOnh4PMzBLMIWBmlmAOATOzBMvnQfNzJL0h6cVSFGRmZqWTz5HAXGBMkeswM7MOkDMEIuJp4O0S1GJmZiVWsEtE0/NcTAHo0qVLobo1M7MiKtiJ4YiYFRG1EVFbWenbD8zMyoGvDjIzSzCHgJlZguVzieg84Fmgv6RGSRcXvywzMyuFnIP3ETGxFIWYmVnpeTjIzCzBHAJmZgnmEDAzSzCHgJlZgjkEzMwSzCFgZpZgDgEzswTzJD9mtl+pnvZIwfvceMtZBe/zQOEQMLMD3w29Ctzfu4XtrwN5OMjMLMEcAmZmCeYQMDNLMIeAmVmCOQTMzBLMIWBmlmAOATOzBHMImJklmEPAzCzB8goBSWMkvSRpvaRpxS7KzMz2lGtfrJQ70+tXSBqWq898HjRfAcwAxgIDgYmSBu7ND2BmZnsnz33xWODo9NcU4J5c/eZzJDAcWB8RGyJiB3A/MK4dtZuZ2b7LZ188Drg3Up4DDpZ0eFud5jOBXB/gtYzlRmBEy0aSppBKHoCQ9EEefXcYpX72poJ2+m8qaHflxNuzsLw9C6vg27M427KbpPqM5VkRMStjOZ99cbY2fYDXW/vQfEIg208be7yQKnZWlrb7JUn1EVHb0XUcKLw9C8vbs7AOkO2Zz744r/11pnyGgxqBIzOWq4BNebzPzMwKJ599cbv31/mEwFLgaEl9JXUBzgUezuN9ZmZWOPnsix8GLkxfJTQSeDciWh0KgjyGgyKiSdIVwGNABTAnIlbt1Y+wfymboasy4e1ZWN6ehVX227O1fbGkqen1M4EFwJnAemAbMClXv4poc7jIzMwOYL5j2MwswRwCZmYJlrgQkNRfUkPG1z8kfauj6ypXkuZIekPSix1dS7nyNiwc/363X6LPCaRvw/4rMCIi/tLR9ZQjSScDW0ndpTioo+spR96GxeHf7/wk7kighS8CL/sfyN6LiKeBtzu6jnKWbRtK+rqkpZL+JOlBSd07qLxy1vz77e3ZuqSHwLnAvI4uwiyL+RFxXEQcC6wBLu7ogspQ5u+3t2crEhsC6Zstvgz8uqNrMctikKTFklYC5wOf6+iCykmW329vz1YkNgRITbm6LCL+3tGFmGUxF7giIgYD/wZ07dhyyk7L3++5eHtmleQQmIiHgmz/1RN4XVJnUn+5Wvu0/P329mxFIkMgfVLoNGB+R9dS7iTNA54F+ktqlOSx1nZqZRv+EHge+D2wtiPrKzet/H57e7Yi0ZeImpklXSKPBMzMLMUhYGaWYA4BM7MEcwiYmSWYQ8DMLMEcAnZAkLQ1/d9qSed1dD1m5cIhYAeaasAhYJYnh4AdaG4BTkrPJX+1pApJt6ZnkFwh6RsAkk6R9JSkX0laJ+kWSedLekHSSkmf7eCfw6wkcj5o3qzMTAO+ExFnA0iaArwbEcdJ+gSwRNLCdNtjgWNITeO8AZgdEcMlXQV8E/DDSOyA5xCwA93pwBBJ49PLvYCjgR3A0oh4HUDSy8CucFgJjCp1oWYdwSFgBzoB34yIx3Z7UToF+DDjpY8zlj/GvxuWED4nYAea90jNGLnLY8Cl6dkjkdRPUo8OqcxsP+S/duxAswJokvQnUnPI/wepK4aWSRLwJvAvHVad2X7Gs4iamSWYh4PMzBLMIWBmlmAOATOzBHMImJklmEPAzCzBHAJmZgnmEDAzS7D/Dw3ty03qXBgKAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<Figure size 432x288 with 2 Axes>"
      ]
     },
     "metadata": {
      "needs_background": "light"
     },
     "output_type": "display_data"
    }
   ],
   "source": [
    "nsents.set_index('Item').join(ntokens.set_index('Item')).plot.bar(secondary_y='# Tokens', rot=0);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:47:22.542194Z",
     "start_time": "2020-06-21T16:47:22.489506Z"
    }
   },
   "outputs": [],
   "source": [
    "ngrams = pd.read_parquet(sec_path / 'ngrams.parquet')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:47:22.543489Z",
     "start_time": "2020-06-21T16:45:19.211Z"
    }
   },
   "outputs": [],
   "source": [
    "ngrams.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:47:22.544028Z",
     "start_time": "2020-06-21T16:45:19.516Z"
    }
   },
   "outputs": [],
   "source": [
    "ngrams.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:47:22.544559Z",
     "start_time": "2020-06-21T16:45:20.162Z"
    }
   },
   "outputs": [],
   "source": [
    "ngrams.score.describe(percentiles=percentiles)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:47:22.545201Z",
     "start_time": "2020-06-21T16:45:23.624Z"
    }
   },
   "outputs": [],
   "source": [
    "ngrams[ngrams.score>.7].sort_values(['length', 'score']).head(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:48:08.840933Z",
     "start_time": "2020-06-21T16:48:08.755568Z"
    }
   },
   "outputs": [],
   "source": [
    "vocab = pd.read_csv(stats_path / 'sections_vocab.csv').dropna()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:48:09.531070Z",
     "start_time": "2020-06-21T16:48:09.500574Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 200867 entries, 0 to 200868\n",
      "Data columns (total 2 columns):\n",
      " #   Column  Non-Null Count   Dtype \n",
      "---  ------  --------------   ----- \n",
      " 0   token   200867 non-null  object\n",
      " 1   n       200867 non-null  int64 \n",
      "dtypes: int64(1), object(1)\n",
      "memory usage: 4.6+ MB\n"
     ]
    }
   ],
   "source": [
    "vocab.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:48:29.793631Z",
     "start_time": "2020-06-21T16:48:29.762029Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count     200867\n",
       "mean        1439\n",
       "std        22312\n",
       "min            1\n",
       "10%            1\n",
       "20%            2\n",
       "30%            3\n",
       "40%            4\n",
       "50%            7\n",
       "60%           12\n",
       "70%           24\n",
       "80%           61\n",
       "90%          260\n",
       "max      2574572\n",
       "Name: n, dtype: int64"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "vocab.n.describe(percentiles).astype(int)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:49:31.376347Z",
     "start_time": "2020-06-21T16:48:40.664233Z"
    }
   },
   "outputs": [],
   "source": [
    "tokens = Counter()\n",
    "for l in (ngram_path / 'ngrams_2.txt').open():\n",
    "    tokens.update(l.split())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:49:31.539973Z",
     "start_time": "2020-06-21T16:49:31.377290Z"
    }
   },
   "outputs": [],
   "source": [
    "tokens = pd.DataFrame(tokens.most_common(),\n",
    "                     columns=['token', 'count'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:49:31.564413Z",
     "start_time": "2020-06-21T16:49:31.541492Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 230112 entries, 0 to 230111\n",
      "Data columns (total 2 columns):\n",
      " #   Column  Non-Null Count   Dtype \n",
      "---  ------  --------------   ----- \n",
      " 0   token   230112 non-null  object\n",
      " 1   count   230112 non-null  int64 \n",
      "dtypes: int64(1), object(1)\n",
      "memory usage: 3.5+ MB\n"
     ]
    }
   ],
   "source": [
    "tokens.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:49:31.576141Z",
     "start_time": "2020-06-21T16:49:31.565627Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>token</th>\n",
       "      <th>count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>million</td>\n",
       "      <td>2340187</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>business</td>\n",
       "      <td>1696732</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>december</td>\n",
       "      <td>1512367</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>company</td>\n",
       "      <td>1490617</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>products</td>\n",
       "      <td>1367413</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "      token    count\n",
       "0   million  2340187\n",
       "1  business  1696732\n",
       "2  december  1512367\n",
       "3   company  1490617\n",
       "4  products  1367413"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokens.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:49:31.713773Z",
     "start_time": "2020-06-21T16:49:31.576981Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count     29951\n",
       "mean        926\n",
       "std        9611\n",
       "min           1\n",
       "10%          26\n",
       "20%          31\n",
       "30%          37\n",
       "40%          46\n",
       "50%          61\n",
       "60%          85\n",
       "70%         131\n",
       "80%         237\n",
       "90%         666\n",
       "max      593859\n",
       "Name: count, dtype: int64"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokens.loc[tokens.token.str.contains('_'), 'count'].describe(percentiles).astype(int)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:49:31.814099Z",
     "start_time": "2020-06-21T16:49:31.714671Z"
    }
   },
   "outputs": [],
   "source": [
    "tokens[tokens.token.str.contains('_')].head(20).to_csv(sec_path / 'ngram_examples.csv', index=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:49:31.906003Z",
     "start_time": "2020-06-21T16:49:31.815020Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>token</th>\n",
       "      <th>count</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>46</th>\n",
       "      <td>year_ended</td>\n",
       "      <td>593859</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>64</th>\n",
       "      <td>results_operations</td>\n",
       "      <td>492047</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>71</th>\n",
       "      <td>table_contents</td>\n",
       "      <td>436034</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>78</th>\n",
       "      <td>company_s</td>\n",
       "      <td>412971</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>85</th>\n",
       "      <td>financial_condition</td>\n",
       "      <td>396164</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>86</th>\n",
       "      <td>common_stock</td>\n",
       "      <td>387629</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>107</th>\n",
       "      <td>fair_value</td>\n",
       "      <td>341108</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>152</th>\n",
       "      <td>united_states</td>\n",
       "      <td>276401</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>158</th>\n",
       "      <td>cash_flows</td>\n",
       "      <td>266725</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>168</th>\n",
       "      <td>financial_statements</td>\n",
       "      <td>255115</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>187</th>\n",
       "      <td>interest_rate</td>\n",
       "      <td>234621</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>188</th>\n",
       "      <td>approximately_million</td>\n",
       "      <td>234385</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>199</th>\n",
       "      <td>adversely_affect</td>\n",
       "      <td>227984</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>223</th>\n",
       "      <td>long_term</td>\n",
       "      <td>203600</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>238</th>\n",
       "      <td>real_estate</td>\n",
       "      <td>192824</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>239</th>\n",
       "      <td>material_adverse</td>\n",
       "      <td>192238</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>240</th>\n",
       "      <td>fiscal_year</td>\n",
       "      <td>192189</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>243</th>\n",
       "      <td>interest_rates</td>\n",
       "      <td>190754</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>248</th>\n",
       "      <td>income_tax</td>\n",
       "      <td>186923</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>267</th>\n",
       "      <td>natural_gas</td>\n",
       "      <td>178765</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                     token   count\n",
       "46              year_ended  593859\n",
       "64      results_operations  492047\n",
       "71          table_contents  436034\n",
       "78               company_s  412971\n",
       "85     financial_condition  396164\n",
       "86            common_stock  387629\n",
       "107             fair_value  341108\n",
       "152          united_states  276401\n",
       "158             cash_flows  266725\n",
       "168   financial_statements  255115\n",
       "187          interest_rate  234621\n",
       "188  approximately_million  234385\n",
       "199       adversely_affect  227984\n",
       "223              long_term  203600\n",
       "238            real_estate  192824\n",
       "239       material_adverse  192238\n",
       "240            fiscal_year  192189\n",
       "243         interest_rates  190754\n",
       "248             income_tax  186923\n",
       "267            natural_gas  178765"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tokens[tokens.token.str.contains('_')].head(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Get returns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:49:43.419751Z",
     "start_time": "2020-06-21T16:49:43.410341Z"
    }
   },
   "outputs": [],
   "source": [
    "DATA_FOLDER = Path('..', 'data')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:49:46.315225Z",
     "start_time": "2020-06-21T16:49:43.586015Z"
    }
   },
   "outputs": [],
   "source": [
    "with pd.HDFStore(DATA_FOLDER / 'assets.h5') as store:\n",
    "    prices = store['quandl/wiki/prices'].adj_close"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:49:46.366663Z",
     "start_time": "2020-06-21T16:49:46.316232Z"
    }
   },
   "outputs": [],
   "source": [
    "sec = pd.read_csv(sec_path / 'filing_index.csv').rename(columns=str.lower)\n",
    "sec.date_filed = pd.to_datetime(sec.date_filed)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:49:46.741667Z",
     "start_time": "2020-06-21T16:49:46.721940Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "RangeIndex: 22631 entries, 0 to 22630\n",
      "Data columns (total 11 columns):\n",
      " #   Column        Non-Null Count  Dtype         \n",
      "---  ------        --------------  -----         \n",
      " 0   cik           22631 non-null  int64         \n",
      " 1   company_name  22631 non-null  object        \n",
      " 2   form_type     22631 non-null  object        \n",
      " 3   date_filed    22631 non-null  datetime64[ns]\n",
      " 4   edgar_link    22631 non-null  object        \n",
      " 5   quarter       22631 non-null  int64         \n",
      " 6   ticker        22631 non-null  object        \n",
      " 7   sic           22461 non-null  object        \n",
      " 8   exchange      20619 non-null  object        \n",
      " 9   hits          22555 non-null  object        \n",
      " 10  year          22631 non-null  int64         \n",
      "dtypes: datetime64[ns](1), int64(3), object(7)\n",
      "memory usage: 1.9+ MB\n"
     ]
    }
   ],
   "source": [
    "sec.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:49:49.396776Z",
     "start_time": "2020-06-21T16:49:49.390774Z"
    }
   },
   "outputs": [],
   "source": [
    "idx = pd.IndexSlice"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:49:51.423980Z",
     "start_time": "2020-06-21T16:49:50.352212Z"
    }
   },
   "outputs": [],
   "source": [
    "first = sec.date_filed.min() + relativedelta(months=-1)\n",
    "last = sec.date_filed.max() + relativedelta(months=1)\n",
    "prices = (prices\n",
    "          .loc[idx[first:last, :]]\n",
    "          .unstack().resample('D')\n",
    "          .ffill()\n",
    "          .dropna(how='all', axis=1)\n",
    "          .filter(sec.ticker.unique()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:49:53.694685Z",
     "start_time": "2020-06-21T16:49:51.425034Z"
    }
   },
   "outputs": [],
   "source": [
    "sec = sec.loc[sec.ticker.isin(prices.columns), ['ticker', 'date_filed']]\n",
    "\n",
    "price_data = []\n",
    "for ticker, date in sec.values.tolist():\n",
    "    target = date + relativedelta(months=1)\n",
    "    s = prices.loc[date: target, ticker]\n",
    "    price_data.append(s.iloc[-1] / s.iloc[0] - 1)\n",
    "\n",
    "df = pd.DataFrame(price_data,\n",
    "                  columns=['returns'],\n",
    "                  index=sec.index)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:49:53.700732Z",
     "start_time": "2020-06-21T16:49:53.695697Z"
    }
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "count    11101.000000\n",
       "mean         0.022839\n",
       "std          0.126137\n",
       "min         -0.555556\n",
       "25%         -0.032213\n",
       "50%          0.017349\n",
       "75%          0.067330\n",
       "max          1.928826\n",
       "Name: returns, dtype: float64"
      ]
     },
     "execution_count": 34,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.returns.describe()       "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:49:53.718447Z",
     "start_time": "2020-06-21T16:49:53.701618Z"
    }
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<class 'pandas.core.frame.DataFrame'>\n",
      "Int64Index: 11375 entries, 0 to 22629\n",
      "Data columns (total 3 columns):\n",
      " #   Column      Non-Null Count  Dtype         \n",
      "---  ------      --------------  -----         \n",
      " 0   ticker      11375 non-null  object        \n",
      " 1   date_filed  11375 non-null  datetime64[ns]\n",
      " 2   returns     11101 non-null  float64       \n",
      "dtypes: datetime64[ns](1), float64(1), object(1)\n",
      "memory usage: 355.5+ KB\n"
     ]
    }
   ],
   "source": [
    "sec['returns'] = price_data\n",
    "sec.info()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 36,
   "metadata": {
    "ExecuteTime": {
     "end_time": "2020-06-21T16:49:58.385354Z",
     "start_time": "2020-06-21T16:49:58.326570Z"
    }
   },
   "outputs": [],
   "source": [
    "sec.dropna().to_csv(sec_path / 'sec_returns.csv', index=False)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": true,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": true,
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
